An astronaut riding a horse (Hiroshige) 2022-08-30

A text-to-image model is a

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

model which takes as input a

natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages ...

description and produces an image matching that description. Such models began to be developed in the mid-2010s, as a result of advances in

deep neural networks Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. ...

. In 2022, the output of state of the art text-to-image models, such as OpenAI's

DALL-E 2 DALL-E (stylized as DALL·E) and DALL-E 2 are deep learning models developed by OpenAI to generate digital images from natural language descriptions, called "prompts". DALL-E was revealed by OpenAI in a blog post in January 2021, and uses a v ...

, Google Brain's Imagen and StabilityAI's

Stable Diffusion Stable Diffusion is a deep learning, text-to-image model released in 2022. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and genera ...

began to approach the quality of real photographs and human-drawn art. Text-to-image models generally combine a language model, which transforms the input text into a latent representation, and a

generative Generative may refer to: * Generative actor, a person who instigates social change * Generative art, art that has been created using an autonomous system that is frequently, but not necessarily, implemented using a computer * Generative music, mus ...

image model, which produces an image conditioned on that representation. The most effective models have generally been trained on massive amounts of image and text data scraped from the web.

History

Before the rise of deep learning, attempts to build text-to-image models were limited to

collage Collage (, from the french: coller, "to glue" or "to stick together";) is a technique of art creation, primarily used in the visual arts, but in music too, by which art results from an assemblage of different forms, thus creating a new whole. ...

s by arranging existing component images, such as from a database of

clip art Clip art (also clipart, clip-art) is a type of graphic art. Pieces are pre-made images used to illustrate any medium. Today, clip art is used extensively and comes in many forms, both electronic and printed. However, most clip art today is creat ...

. The inverse task, image captioning, was more tractable and a number of image captioning deep learning models came prior to the first text-to-image models. The first modern text-to-image model, alignDRAW, was introduced in 2015 by researchers from the

University of Toronto The University of Toronto (UToronto or U of T) is a public research university in Toronto, Ontario, Canada, located on the grounds that surround Queen's Park. It was founded by royal charter in 1827 as King's College, the first institution ...

. alignDRAW extended the previously-introduced

DRAW Draw, drawing, draws, or drawn may refer to: Common uses * Draw (terrain), a terrain feature formed by two parallel ridges or spurs with low ground in between them * Drawing (manufacturing), a process where metal, glass, or plastic or anything ...

architecture (which used a recurrent

variational autoencoder In machine learning, a variational autoencoder (VAE), is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling, belonging to the families of probabilistic graphical models and variational Bayesian methods. ...

with an

attention mechanism In artificial neural networks, attention is a technique that is meant to mimic cognitive attention. The effect enhances some parts of the input data while diminishing other parts — the motivation being that the network should devote more focus ...

) to be conditioned on text sequences. Images generated by alignDRAW were blurry and not photorealistic, but the model was able to generalize to objects not represented in the training data (such as a red school bus), and appropriately handled novel prompts such as "a stop sign is flying in blue skies", showing that it was not merely "memorizing" data from the training set. In 2016, Reed, Akata, Yan et al. became the first to use

generative adversarial network A generative adversarial network (GAN) is a class of machine learning frameworks designed by Ian Goodfellow and his colleagues in June 2014. Two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is a ...

s for the text-to-image task. With models trained on narrow, domain-specific datasets, they were able to generate "visually plausible" images of birds and flowers from text captions like "an all black bird with a distinct thick, rounded bill". A model trained on the more diverse COCO dataset produced images which were "from a distance... encouraging", but which lacked coherence in their details. Later systems include VQGAN+CLIP, XMC-GAN, and GauGAN2. One of the first text-to-image models to capture widespread public attention was

OpenAI OpenAI is an artificial intelligence (AI) research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. The company conducts research in the field of AI with the stated goal of promo ...

DALL-E DALL-E (stylized as DALL·E) and DALL-E 2 are deep learning models developed by OpenAI to generate digital images from natural language descriptions, called "prompts". DALL-E was revealed by OpenAI in a blog post in January 2021, and uses a ver ...

, a

transformer A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...

system announced in January 2021. A successor capable of generating more complex and realistic images, DALL-E 2, was unveiled in April 2022, followed by

publicly released in August 2022. Following other text-to-image models, language model-powered text-to-video platforms such as Runway, Make-A-Video, Imagen Video, Midjourney and Phenaki can generate video from text and/or text/image prompts.

Architecture and training

Text-to-image models have been built using a variety of architectures. The text encoding step may be performed with a

recurrent neural network A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...

such as a

long short-term memory Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...

(LSTM) network, though

models have since become a more popular option. For the image generation step, conditional

s have been commonly used, with

diffusion model In machine learning, diffusion models, also known as diffusion probabilistic models, are a class of latent variable models. They are Markov chains trained using variational inference. The goal of diffusion models is to learn the latent structure of ...

s also becoming a popular option in recent years. Rather than directly training a model to output a high-resolution image conditioned on a text embedding, a popular technique is to train a model to generate low-resolution images, and use one or more auxiliary deep learning models to upscale it, filling in finer details. Text-to-image models are trained on large datasets of (text, image) pairs, often scraped from the web. With their 2022 Imagen model, Google Brain reported positive results from using a large language model trained separately on a text-only corpus (with its weights subsequently frozen), a departure from the theretofore standard approach.

Datasets

Training a text-to-image model requires a dataset of images paired with text captions. One dataset commonly used for this purpose is COCO (Common Objects in Context). Released by Microsoft in 2014, COCO consists of around 123,000 images depicting a diversity of objects, with five captions per image, generated by human annotators. Oxford-120 Flowers and CUB-200 Birds are smaller datasets of around 10,000 images each, restricted to flowers and birds, respectively. It is considered less difficult to train a high-quality text-to-image model with these datasets, because of their narrow range of subject matter.

Evaluation

Evaluating and comparing the quality of text-to-image models is a challenging problem, and involves assessing multiple desirable properties. As with any generative image model, it is desirable that the generated images be realistic (in the sense of appearing as if they could plausibly have come from the training set), and diverse in their style. A desideratum specific to text-to-image models is that generated images semantically align with the text captions used to generate them. A number of schemes have been devised for assessing these qualities, some automated and others based on human judgement. A common algorithmic metric for assessing image quality and diversity is

Inception score The Inception Score (IS) is an algorithm used to assess the quality of images created by a generative image model such as a generative adversarial network (GAN). The score is calculated based on the output of a separate, pretrained Inceptionv3 image ...

(IS), which is based on the distribution of labels predicted by a pretrained

Inceptionv3 Inception v3 is a convolutional neural network for assisting in image analysis and object detection, and got its start as a module for GoogLeNet. It is the third edition of Google's Inception Convolutional Neural Network, originally introduced dur ...

image classification model when applied to a sample of images generated by the text-to-image model. The score is increased when the image classification model predicts a single label with high probability, a scheme intended to favour "distinct" generated images. Another popular metric is the related

Fréchet inception distance The Fréchet inception distance (FID) is a metric used to assess the quality of images created by a generative model, like a generative adversarial network (GAN). Unlike the earlier inception score (IS), which evaluates only the distribution of gen ...

, which compares the distribution of generated images and real training images, according to features extracted by one of the final layers of a pretrained image classification model.

Impact and applications

References

{{reflist, refs= {{cite document , title=Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , date=23 May 2022 , last1=Saharia, first1=Chitwan , last2=Chan, first2=William , last3=Saxena, first3=Saurabh , last4=Li, first4=Lala , last5=Whang, first5=Jay , last6=Denton, first6=Emily , last7=Kamyar Seyed Ghasemipour, first7=Seyed , last8=Karagol Ayan, first8=Burcu , last9=Sara Mahdavi, first9=S. , last10=Gontijo Lopes, first10=Rapha , last11=Salimans, first11=Tim , last12=Ho, first12=Jonathan , last13=J Fleet, first13=David , last14=Norouzi, first14=Mohammad , arxiv=2205.11487 , url=https://arxiv.org/abs/2205.11487 {{cite news , last1=Vincent , first1=James , title=All these images were generated by Google's latest text-to-image AI , url=https://www.theverge.com/2022/5/24/23139297/google-imagen-text-to-image-ai-system-examples-paper , access-date=May 28, 2022 , work=The Verge , publisher=Vox Media , date=May 24, 2022 {{cite document , title=A Survey and Taxonomy of Adversarial Neural Networks for Text-to-Image Synthesis , url=https://arxiv.org/pdf/1910.09399.pdf , date=October 2019 , last1=Agnese, first1=Jorge, last2=Herrera, first2=Jonathan, last3=Tao, first3=Haicheng, last4=Zhu, first4=Xingquan , arxiv=1910.09399 {{cite journal , title=A text-to-picture synthesis system for augmenting communication , year=2007 , last1=Zhu, first1=Xiaojin, last2=Goldberg, first2=Andrew B., last3=Eldawy, first3=Mohamed, last4=Dyer, first4=Charles R., last5=Strock, first5=Bradley , url=https://www.aaai.org/Papers/AAAI/2007/AAAI07-252.pdf , journal=AAAI , volume=7 , pages=1590–1595 {{cite journal , title=Adversarial text-to-image synthesis: A review , date=December 2021 , journal=Neural Networks, volume=144, pages=187–209 , last1=Frolov, first1=Stanislav, last2=Hinz, first2=Tobias, last3=Raue, first3=Federico, last4=Hees, first4=Jörn, last5=Dengel, first5=Andreas , doi=10.1016/j.neunet.2021.07.019 , pmid=34500257 , s2cid=231698782 , url=https://doi.org/10.1016/j.neunet.2021.07.019 {{cite journal , title=Generative Adversarial Text to Image Synthesis , date=June 2016 , last1=Reed, first1=Scott, last2=Akata, first2=Zeynep, last3=Logeswaran, first3=Lajanugen, last4=Schiele, first4=Bernt, last5=Lee, first5=Honglak , journal=International Conference on Machine Learning , url=http://proceedings.mlr.press/v48/reed16.pdf {{cite journal , title=Generating Images from Captions with Attention , last1=Mansimov, first1=Elman, last2=Parisotto, first2=Emilio, last3=Lei Ba, first3=Jimmy, last4=Salakhutdinov, first4=Ruslan , date=November 2015 , journal=ICLR , arxiv=1511.02793 , url=https://arxiv.org/abs/1511.02793 {{cite web , work=TechCrunch , last=Coldewey, first=Devin , date=5 January 2021 , title=OpenAI's DALL-E creates plausible images of literally anything you ask it to , url=https://techcrunch.com/2021/01/05/openais-dall-e-creates-plausible-images-of-literally-anything-you-ask-it-to/ {{cite web , work=TechCrunch , last=Coldewey, first=Devin , date=6 April 2022 , title=OpenAI's new DALL-E model draws anything — but bigger, better and faster than before , url=https://techcrunch.com/2022/04/06/openais-new-dall-e-model-draws-anything-but-bigger-better-and-faster-than-before/

History

Architecture and training

Datasets

Evaluation

Impact and applications

See also

References